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Abstract 


In part 1 of this paper we presented architecture of NETRA [1]. This paper presents performance evaluation of 
NETRA using several common vision algorithms. Performance of algorithms when they are mapped on one cluster 
is described. It is shown that SIMD, MIMD and systolic algorithms can be easily mapped onto processor clusters, 
and almost linear speedups are possible. For some algorithms, analytical performance results arc compared with 
implementation performance results. It is observed that the analysis is very accurate. Performance analysis of paral- 
lel algorithms when mapped across clusters is presented. Mappings across clusters illustrate the importance and use 
of shared as well as distributed memory in achieving high performance. The parameters for evaluation are derived 
from the characteristics of the parallel algorithms, and these parameters are used to evaluate the alternative com- 
munication strategies in NETRA. Furthermore, the effect of communication interference from other processors in 
the system on the execution of an algorithm is studied. Using the analysis, performance of many algorithms with 
different characteristics is presented. It is observed that if communication speeds arc matched with the computation 
speeds, good speedups are possible when algorithms arc mapped across clusters. 


This research was supported in part by National Aeronautics aid Space Administration Under Contract NASA NAG-1 -613. 
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1. Introduction 


In the first part of this paper [1], we described the architecture of NETRA, its features, and presented an 
analysis of inter-cluster communication strategies. This paper presents performance evaluation of NETRA using 
several common image processing and vision algorithms. The performance of components of NETRA is illustrated 
usin g algorithms with varying characteristics and communication requirements. The results are used to identify the 
bottlenecks of NETRA and to suggest methods to improve and refine the architecture. 

For each algorithm we present one or more mapping strategies, its performance evaluation, and a discussion 
of the results. The algorithms include two-dimensional (2-D) FFT, convolution, separable convolution, hough 
transform, sobel edge detection, and median filtering. Some of the algorithms are part of the Image Understanding 
Benchmark presented in [2]. The approach to performance evaluation has been described in the first paper. To 
evaluate parallel algorithms on a cluster, we explore alternative mapping strategies, and computation modes. Some 
of the algorithms have been implemented on a simulated cluster, and we show that the analysis provides very accu- 
rate results. We also discuss performance of the algorithms when they are mapped across multiple clusters. The 
results are used to compare alternative inter-cluster communication strategies. 

Table 1 depicts the parameters used for performance evaluation unless specified otherwise. The values of 
computation and communication speeds are chosen to be conservative. We think that much faster processors and 
communication links are possible, and available with current technology, and therefore, the performance results 
presented in this paper are conservative. Since the goal is to study the architecture behavior rather than present raw 
performance numbers, we have chosen the above parameters. 

This paper is organized as follows. Section 2 contains analysis and performance of various algorithms on one 
cluster. We show mappings in SIMD as well as MIMD modes. In Section 3 we present implementation results for 


Table 1 : Parameters for Performance Evaluation 


Total No. of Processors N„ 

512 

Cluster Size P r 

8-128 

No. of Processors/Port P p 

4 

Image Size NxN 

512X512 

Memory Modules M 

128 

Processor Speed 

5 MIPS. 5 MFLOPS 

Network Speed (Block Transfer) 

20 Mbytes/Sec. 

Traffic Intensity for 
Interference (mxr) 

0.1,0.4,0.8 
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four algorithms, two of which (median and sobel) are parts of the Image Understanding Benchmark Algorithms [2], 
and note that the analytical results are comparable to the implementation results. Section 4 discusses the perfor- 
mance of algorithms when they are mapped across multiple clusters. We present results for both multistage inter- 
connection network between cluster processors as well as a global bus interconnection. Finally, in Section 5, sum- 
mary are presented. For further details, the reader is referred to [3]. 

2. Parallel Algorithms on a Cluster 

In this section we describe how various algorithms with different characteristics can be mapped onto a proces- 
sor clusters The algorithms are classified according to their computation and communication characteristics as well 
as their suitability for implementation on SIMD or MIMD architectures. The purpose of this section is manifold : 
first, we illustrate how SIMD, systolic, and MIMD type algorithms can be efficiently mapped onto the cluster; 
second, how algorithms that need both types of computations can be mapped; and finally, to show how algorithms 
from different c lasses can be mapped onto the cluster. In the evaluation we discuss the computation, communication 
and storage requirements for the algorithms. 

The following is a brief review of the classes of algorithm categorized according to their communication 
requirements, and is presented in (1]. 

(1) Local Fixed - In these algorithm, the output depends on a small neighborhood of input data in Which the 
neighborhood size is normally fixed. 

(2) Local Varying - Like the local fixed algorithms, the output at each point depends on a small neighborhood of 
input data However, the neighborhood size is an input parameter and is independent of the input image size. 

(3) Global Fixed - In such algorithms each output point depends on the entire input image. However, the compu- 
tation is normally input data independent. 

(4) Global Varying - Unlike global fixed algorithms, in these algorithms the amount of computation and commun- 
ication depends on the image input as well as its size. That is, the output may depend on the entire image or 
may depend on a part of image. 

2.1. 2-D Convolution 

2-D Convolution is a Local varying type of algorithm. A 2-D convolution of an NxN image I(i j), 0<=i j<=N, 
with a kernel W(iJ), 0<=ij<=w, can be expressed as follows : 
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m—j+w/2 n~t+w/2 

G (i,j) = X X I(n,m)*W ((i+w/2-n) mod w, (j+w/2-n) mod w) 

m^j—w/2 n=i—wl2 

In other words, each point in the output is replaced by a weighted sum of a window wxw around it. 

The convolution algorithm will illustrate how to map SIMD and systolic algorithms onto a processor cluster 
when the number of processors is much smaller than the problem size. The approach to map the algorithm onto a 
cluster is to transform 2-D convolution to a 1-D convolution by unfolding the window in one dimension, and 
without incurring additional steps for unfolding. In other words, the amount of computation in the corresponding 
1-D convolution remains the same as that in 2-D convolution. Figure 1 shows a cluster of 64 processors. The inter- 
connection between processors shows all the connections required to perform the convolution operation. However, 
all the connections are not needed at the same time. We shall observe that only one input and one output connection 
is sufficient at any time, and that the flexibility of the crossbar can be used to obtain all the desired interconnections 
efficiently. 

Each pixel is logically mapped onto a separate processor (as if there were as many processors available as 
there are pixels). Actually the image is folded in two dimensions like a torus, and multiple pixels are mapped onto 
one processor. For a cluster size P, (assume P = pxp), each processor has M = N z /P pixels in its local memory. In 
general, pixel (/,_/) ; (Ki <N -1, 0&j£N — 1 is mapped to processor ((i mod p), (j mod p)). Therefore, this mapping 
preserves the adjacency of any two pixels even though the image is folded. 

Figure 1 shows the flow of the distribution of data for window size 5x5. A small window is embedded in a 
larger one and therefore, same connections can be used for a larger window size with the addition of new connec- 
tions for extra steps. In other words, all the computations and communication needed in a small window can be 
used for a larger window. For example, 5x5 window requires all the connections that are required by 3x3 window. 
The algorithm performs the convolution by each processor distributing its pixel values to the neighborhood in a 
pipelined manner. 

In the following algorithm North, South, East and West Neighbors are defined treating the image as a torus. 
For a processor P(i j), N,S,E,W neighbors are defined as follows. 

N = ((i-l)j), if (i-j) < 0, then N = ((i-1 + p), j) 

S = ((i+l) modp.j) 

E = (i, (j+1) mod p) 

W = (i, 0-1)). if 0-1) < 0. then W = (i, 0-1+P)) 



Mtu um i. 


bb — 

SSEBESSEd 

mss mm 

maussesmassassmma 


Bb^hBbBmBbKmBbw 


5x5 window 

Figure 1 : Mapping on the Cluster for Convolution 


At any s t^ p in the execution of the algorithm, all the processors have the same neighbor connection. For example, at 
a given instance, if processor P, is connected to Pj such that Pj is the North neighbor of then all other proces- 
sors will also be connected to their north neighbors. Figure 1 shows how processor (33) ’s values will be distributed. 
All the processors follow the same pattern. Note that the above definition of neighbors is for logical neighbors 
because it uses pixel adjacency rather than processor adjacency. The above definition does not imply any physical 
connections between processors because the connections are programmed according to pixel adjacency. 
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The algorithm works as follows (Figure 2): The DSP broadcasts the convolution weights to all the processors. 
F ac h processors multiplies its M pixels with the central weight value. In Figure 2 the data values at each processor 
are stored in a linear array and subscript (i j) means the data value i in the connection number j. The intermediate 
values are stored in the running variable for each of the M pixels. The image is then shifted in a spiral manner (as 
shown in Figure 1). If the image is shifted north then the processors now multiply the pixel values with the south 
weight. This process is repeated w^— 1 times , i.e., for each weight. The following properties characterize the above 
algorithm. First, the mapping is independent of problem or cluster size. That is, the mapping will work for all prob- 
lem sizes. Second, the number of times the interconnection needs to be changed only depends on the convolution 
kernel size. Furthermore, at any time only one input and one output connection is required. By storing the connec- 
tion patterns in the crossbar memory the switching time becomes negligible. Third, it is possible to overlap compu- 
tation and communication by writing the pixel to the output port as soon as it is multiplied by the appropriate weight 
in the current processor. The above algorithm illustrates that SIMD algorithms can be mapped efficiendy on to the 
processor clusters using the flexibility and programmability of the interconnection. 


ALGORITHM CONVOLUTION 


All the processors work in SIMD lock-step fashion. 

DSP broadcasts the convolution kernel. 

Set up Connection_arTay of size wxw in the crossbar memory by choosing, 
first wxw connections from the set 


(N£^,S,Vf,W^^£^^,S,S,W,W,W,W^W£,..). 
M:=|^ 


For i = 1 to M do (in parallel) 
Result(i) :=Wj dataii) 

End_For 


For j = 1 to wxw do (in parallel) 

Set up appropriate connections on the crossbar as follows. 

connection(j) := connection_array(j) 

For i = 1 to M do (in parallel) 

Send dam (pixels) on the output port to the connected 
neighbor. 

At the same time receive data from its input port. 
Result® := Result ( i ) + w i } * data ( ij ) 

End_for 
End for 

END CONVOLUTION 

Figure 2 : 2-D Convolution 
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The computation time decreases as the number of processors increases. The communication time per pixel 
only depends on the kernel size. The following formulae present the computation and communication times in terms 
of multiplication and addition operations. The factor tp denotes the floating point speed of a processors in terms of 


its normal instruction execution speed. 


t cp = 2xtflX 




xw 


N 1 


xw 


p 

tsw = W 2 -l 
t lot = Max(? (: ^ i 


Figure 3 shows the performance of the 2-D convolution on a processor cluster. The processing time has been 
computed assuming a 2 MFLOP processor. The Figure shows two speedup graphs, one with communication overlap 
and the other with additive communication. The computation time decreases linearly as the number of processors 
inc reases The total communication time per processor also decreases linearly, but the communication time per pixel 
remains constant. An important observation one can make here is that it is essential that the communication and 



Figure 3 : Performance of 2-D Convolution on a Processor Cluster 
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comp utati on overlap in order to obtain linear speedups. However, if the interconnection speed is not matched with 
the computation speed, then overlap will not be possible. Having a fast crossbar without arbitration delays provides 
the necessary communication speed to obtain linear speedups. Note that since computation and communication can 
overlap, this mapping is also applicable to systolic algorithms. 

22. Separable Convolution 

A two dimensional convolution is separable if it can be replaced by two one dimensional convolutions. The 
main advantage of separability is that the computational requirements per pixel are reduced from 2 w to 4 w. We 
illustrate the mapping by giving an MIMD algorithm to run on a cluster. 

The data is partitioned among the processors as follows. Each processor is assigned N IP rows of the data. 
Processor P, gets rows (i-\)xN/P to ixN/P - 1. Each processor computes convolution along the rows using a 
window of size w. Once processor /*; finishes convolution along the rows, it needs rows (/— l)xNIP -w/2 to 
(i-l)xN/P - 1, from processor /*,_ j, and similarly , it needs the bottom w/2 rows from ixNIP to 
ixN/P + w/2 -1 from processor P i+ \ . Therefore, a processor needs to communicate with only two processors to 
obtain the desired intermediate data. The boundary processors Pq and Pp-\ only need to communicate with one 
other processor. Note that if the granule size with each processor is less than w/2 (i.e., NIP < w/2) then the pro- 
cessors need to exchange data with the number of processors given below by T^. Each processor computes convo- 
lution along the columns in its granule. The following are computational and communication requirements of the 
algorithm. 

/j,x/V 2 x4x(w/2+l) 

f cp = p 

I comm = 2XA/XW 



The amount of computation per pixel in separable convolution is a function of w for a wxw kernel unlike in 
2-D convolution where it is a function of w 2 . The amount of communication in separable convolution is fixed as 
shown in Figure 4. Therefore, the speedup is not as much as in the case of 2-D convolution. There are two reasons 
for smaller speedup. First, the communication does not reduce with increasing number of processors because each 
processor needs to exchange w/2 rows of intermediate results with two adjacent processors. Secondly, since the 
computation per pixel itself is small, the communication overhead as a fraction of computation time is large. 
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Number of Processors 


Number of Processors 


Figure 4 : Performance of Separable Convolution on a Processor Cluster 

2.3. 2-D FFT 

2-D FFT is a Global Fixed algorithm. For an imago I(k,l), 0<=k,l<=N, the corresponding 2-D FFT is given by 

F(m,n) = N £ *£ /(*,/) e - 2nj(km+ln),N . (k=m,n<=N-l 
*■0 /*) 

where j = V— 1 . A nice property of the 2-D FFT is that it can be computed in two steps : a one dimensional N 
point FFT along the rows followed by a one dimensional N point FFT of the columns of row FFT values, or vice 
versa. We use this property to map 2-D FFT on the cluster processors. The algorithm consists of three phases : 1-D 
FFT computation along rows, transposing the intermediate results and, 1-D FFT along the columns. 

Figure 5 describes the algorithm. In the first phase each processors is assigned N IP rows. Let us denote the 
sequence of rows with processor P, as Granule(i). Also, let’s divide each granule into P equal blocks of size N IP 
as shown in Figure 6. B(ij) denotes a block of size N 2 /P 2 with processor P, , 0<=j<=P-l. Each processors com- 
putes the 1-D FFT along the rows of its granule. Then in the second phase, the processors communicate with each 
other in the following manner to transpose the intermediate results. A processor P, sends block B(ij) to processor 
Pj for all 0<=j<=P-l, j*L Each processor needs to communicate and exchange a block with every other processor 
in the cluster. However, by performing the communication systematically, the transpose can be achieved without 




9 


any conflicts as described in the algorithm. Finally, each processor computes 1-D FFT along the columns. 


ALGORITHM 2-D FFT 


Each processor P, receives granule(i) of rows. 

/* The follovying description is with respect to processor P, */ 
iV 2 
P 

For k = 1 to M do 

compute 1-D FFT of row(k) of granule(i) 


M := 


For j * 1 to M do (i * j) 
k = i+j mod P 
connect P,- to P* 
send Block(i j) to P* 
receive Block(j4) from P* 


For k s l to M do 

compute 1-D FFT of row(k) of granule(i) 


END 2-D FFT 


Figure 5 : 2-D FFT 
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The 1-D FFT of a row of N pixels can be computed[4] The constant of multiplication is 6, i.e., to perform N 
point 1-D FFT it takes approximately 6NlogN floating point operations. Therefore, the computation time for the 

above algorithm is (for both row and column steps) 

1 2xtflXN xN xlog2 Af 

l cp~ p 

The communication time to transpose the intermediate results is 

tco~ m =2x(P-l)xN 2 /P 2 
and the number of switch settings are, — P— 1. 

Even though FFT is a Global Fixed algorithm, in the above mapping both the computation and communica- 
tion times reduce as the number of processors increases. In other words, both computation and communication are 
deco m po sable for parallel processing. Therefore, if the communication is achieved without conflicts (as in our 
case), we can obtain linear speedups. 

Figure 7 and 8 show the performance of 2-D FFT algorithm on a processor cluster. From Figure 7 we can 
observe that almost linear speedup can be obtained. The variation of the communication time as a function of the 
number of processors is shown is Figure 8. Note that communication time curve follows the computation time curve 
in its shape and the communication is completely decomposable. 

2.4. Hough Transform 

Hough transform is used to detect curves ( such as lines, circles, and ellipses) in an input image[5]. The com- 
putation is performed in the parameter space of the curve. Consider the example of detecting line segments, com- 
putation is performed in the (r,0) parameter space. If there exists a line whose normal distance from the origin is r, 
the normal makes an angle 0 with the x-axis then if the point (x.y) lies on that line than the following equation is 
satisfied. 

r -xcosQ +ysin9 

First, r and 0 are quantized. The quantization depends cm how much accuracy is required in the final result Assume 
that the maximum value of r br r max maximum value of 0 be 0 m „ (generally jc). Then if r rts , Q res are the resolu- 
tions used for quantization, the total number of accumulator cells in the computation are r max -0max/ r r«-9r«. the 
number of rows and columns in the accumulator array being 0 C = 0^/0^ and p c = r ma /r res , respectively. The 
al gorithm involves two major steps. The first step is to accumulate votes in the accumulator array for various digi- 
tized r and 0 values. The second step is to compute local maxima in the output of the first step. The first step is 
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Figure 7 : Performance of 2-D FFT on a Processor Cluster 



128 


64 


32 

16 


Speedup 


Figure 8 : Communication Time for 2-D FFT on a Processor Cluster 
regular and suitable for SIMD implementation. The second step is more suitable for MIMD implementation because 
the output is global data dependent For example, an image containing many lines will result in many more maxima 
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than an image containing a few lines, and therefore, the required computation will vary. 

Hough transform is Global Varying algorithm. Furthermore, the communication can not be decomposed. We 
present two mappings of the Hough transform on the processor cluster. The first mapping divides the input image 
into as many granules as the number of available processors. The second mapping divides the the parameters among 
the processors. The former is referred to as "data partitioning" and the latter as "parameter partitioning.” We discuss 
advantages and disadvantages of both the mappings and also compare their computation time, communication time 
and memory requirements. 


Data Partitioning 


Assume that the input image is NxN, and to simplify the discussion assume that the number of available pro- 
cessors is P =p 2 - The image is partitioned into N 2 /p 2 blocks. Processor P(ij) works on block i*p + j , where 
\<i,j<p. Each processor computes the vote count for its part of the image for all quantizations of 0 values. Figure 9 
shows the accumulator array for a processor. Note that each processor has to maintain a complete accumulator array 
of size p c x0 c , and update the appropriate vote count computed from its share of the image. The algorithm 

ACCUMULATE_COUNT in Figure 9 shows the computation for this step. The computation time to compute the 

.21 


accumulator array is time taken to perform 2x 


xfxQ c multiplications, and half as many additions, where f is 


the largest fraction of significant pixels in a block and 0 C is the number of quantizations for 0. The next step is to 
combine the partial results of all the processor to obtain a global accumulator array so that maxima can be deter- 
mined. For combining the partial results we propose the tree sum method in which, at each step, twice as many pro- 
cessors combine their partial results, therefore requiring 2x\ogp—\ steps. 



Accum_array(i j) for one processor 


Figure 9 : Accumulator array for Hough Transform 
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algorithm accumulate_count 

Each processor Pi, \<i<p 2 does the following (in parallel) 

For j « 1 to 0 C do 

For each (x,y) in the subimage such that (x,y) is significant do 
/"significant means black pixel or edge element*/ 
compute r(0 ; ) = x cos0 ; + y sin0 ; 

Accum_array(0 ; ,r(0 ; )//>«) = Accum_array(0y/(0y )/r rej ) + 1 
End_For 

EndFor 

END ACCUMULATE_COUNT 
Figure 10 : Algorithm to Compute Votes in Hough Transform 


The algorithm ACCUMULATE_SUM in Figure 11 performs the merging of partial results. The processors 
are numbered from 0 to p 2 — 1. A processor with number k, Q<Jc<p 2 - 1 corresponds to a processor (ij) such that 
k=i*p+j. 

Following this step, processor Pq has the entire accumulator sum. The next step is to distribute this global 
accumulator sum to all the processors so that computation for local maxima can be performed in parallel. This step 
needs only one step. Processor P 0 broadcasts the entire array to all the processors using the broadcast facility of the 


/* Accum_arrayt(i,j) denotes the accumulator cell (i j) 
the Accumulator array of processor k. */ 

ALGORITHM ACCUMULATE_SUM 

For i = 0 2xlog2/?-l do 
For all processors P, dp in parallel (0£j<p 2 -\) 

If j mod 2 ,+I = 2' then 
Connect Pj --> Pj~x 
For k = 1 to 0 C do 
For 1 = 1 to p c do 

Send Accwn_arrayj(k,l) Pj-? 

Accum_arrayj-x (k,l) := Accumjxrray j-y ( k , l ) 
+ Accum_arrayj(k,l) 

EndFor 

EndJFor 

Endjf 

EndFor 

EndJFor 

END ACCUMULATE SUM 
Figure 11 : Algorithm to Accumulate the Vote Count 
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crossbar. After the broadcast step, each processor performs a search few local maxima on its share of the accumula- 


tor rows 


0c 


. In this algorithm, for each entry in its block of the accumulator array, the processor determines 


whether the entry represents a local maxima. 

In summary, the total computation and communication time requirements few the entire hough transform algo- 
rithm using the data partitioning are as follows. 

Af 2 ! 


t cp = 3xtyX 


2/_2 


x/X0 c + Q c xCxlogP + 0 c xp e xw Ip 


where, the first term is for computing the votes, the second term is to sum the accumulator array, and the third 
term is for looking for local maxima in a window of size w 2 . The communication time for this algorithm is 

tcomm = (lOgP +I)x8 c xp c 

and the number of switch settings are = logP+l . 


Unlike 2-D FFT, the communication is not decomposable. In other words, the communication increases as the 
number of processors increases in a cluster. Figure 12 shows the computation and communication time along with 
the speedup for hough transform. Even though the computation time for hough transform decreases as the number 
of processors increases, the computation is not completely decomposable. The second term (to combine partial 
results) of t cp increases as a log function of the number of processors. Furthermore, the communication overhead 
to combine accumulator arrays also increases logarithmically with the number of processors. Consequendy, for a 
large number of processors, the communication time becomes comparable to the computation time (as shown in 
Figure 12), and that results in degradation in the speedup. 


Parameter Partitioning 

In this mapping, instead of partitioning the data among the processors, the parameters space is partidoned. 
Farh processor works on the entire image but computes the vote count for only few 0 values. Each processor com- 
putes all p values for its share of 0 values. If there are p 2 processors, then each processor works on n — Q c /p 2 
values of 0. There are several advantages to this mapping, both in terms of communicadon and implementadon at 
each processor. First, when looking for local maxima later, a processor needs to communicate with only two other 
processors to obtain the upper and lower boundary rows of the Accumulator array. Second, we introduce addiuonal 
Hata structures to make the search for local maxima efficient, where instead of searching for the local maxima in the 
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Figure 12 : Performance of Hough Transform (Data Partitioning) 
entire accumulator array, only a fraction indicating possible local maxima need to be searched. Furthermore, the 
processor can store sin0, COS0 values for its allocated n values of 0 in its registers, since only a few values need to 
be stored. This results in saving on local memory access delays which would occur if all quantized sin0 and cos0 
values are stored with each processor in its local memory. The algorithm to compute the accumulator array at each 
processor is similar to that in the case of data partitioning except that each processor works on the entire image but 
only on its own part of the parameters. 

A brief explanation of the algorithm is as follows. In the first step (computing votes), the algorithm computes 
value of p for each significant pixel for all 0 values. It then increments the appropriate count in the Accumulator 
array. If the count increases beyond a certain threshold value, there exists a possibility of this being a local maxima. 
Therefore, another array called the Link_array is updated marking this fact. This step reduces the search space 
when looking for local maxima since normally a very small fraction of the image contributes to lines and entire 
Accumulator array need not be searched when looking for local maxima. Once the above computation is finished 
for the entire image, processor P, communicates with P l+ i and P,_i to obtain the boundary rows of the Accumula- 
tor array. Then the local maxima are computed in the Accumulator array using the information available in 


JU- 


Processing Time 
(In Secs.) 
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Link_array. There is a need to search only those entries in the Accumulator array for a local maxima which are 
marked by the Link_array. The computation, communication and memory requirements for this mapping are as fol- 


lows. 


t ep = 3xfy,x 



xfx.Q c + 0 c xp c xw 2 /p 2 


where the first term is for computing the votes and the second term is to for local maxima in a window of size 
yp . The communication time for this algorithm is 

tcomm = 2xp c , 

and the number of switch settings are = 2. 

The memory requirements of the two partitionings are comparable. For example, for an image size of 
512x512, value of p c will typically be 512xV2\ and 0 C will be 180. However, each pixel normally is a byte where 
as each accumulator cell is an integer. Assuming a 4 byte integer, in data partitioning a processor has to store the 
entire accumulator array of size 521 Kbytes (approximately), and in the second mapping a processor has to store the 
entire image (256 K bytes), and its part of the accumulator array. 


Processing Time 
(In Secs.) 



Speedup 


Figure 13 : Performance of Hough Transform (Parameter Partitioning) 
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Another way in which the parameter partitioning mapping can be performed is as follows. Instead of storing 
the image in all the processors, a controller processor, such as a DSP can store the image and broadcast each 
significant pixel value and its location while processors compute the votes in an SIMD lock-step fashion. This 
results in saving the memory, because now only one processor need store the image. The communication require- 
ment for this mapping is fxN 2 , where f is the fraction of significant pixels. However, the communication can be 
overlapped with computation because while processors are computing the vote count for a location in the image, the 
next location can be broadcast. Therefore, the time to compute the Accumulator_array in this case will be MAX(r cp 
, Broadcast time for fxN 2 pixels locations). 

By using parameter partitioning, the overhead of combining partial results is eliminated, and for each proces- 
sor the communication is reduced to exchanging one row of the accumulator array with two other processors. There- 
fore, the communication remains constant as the number of processors increases. Figure 13 shows the speedup, 
computation time and communication time for hough transform using parameter partitioning. Figure 14 compares 
the communication overhead and the speedup for the two types of partitioning. Notice that using parameter parti- 
tioning it is possible to obtain almost linear speedup. 



Speedup 


Figure 14 : Comparison of Performance of Parameter Partitioning and Data Partitioning for Hough Transform 
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3. Parallel Implementation Results 

This section contains implementation of some algorithms on a simulated processor cluster. A cluster was 
^mnian»H on an intel iPSC/2 hypercube multiprocessor. The performance results capture all the overheads associ- 
ated with parallel programming, and therefore, the results are very accurate. Also, we show through the example of 
2-D FFT algorithm that the analysis presented in the previous section is very close to the implementation results. We 
present performance results for four algorithms in this section. Two algorithms are 2-D FFT and separable convolu- 
tion. The other two algorithms are parts of the Image Understanding Benchmark Algorithms developed by Weems 
et al [2]. The two algorithms are sobel edge detection and median filtering. The performance of the algorithms has 
been evaluated using the test data provided with the benchmark algorithms [2]. 

Table 2 shows the performance for separable convolution implementation on a 256x256 image with window 
size 10x10. The table shows the major computation operations in the algorithm which include floating point opera- 
tions as well as integer operations. The fifth column shows the number of times connection in the crossbar needs to 
be changed during the algorithm execution, and column 6 contains the rounded value of the amount of data com- 
municated in KBytes. The table shows that the communication time is very small compared to the computation time, 
and therefore, good speedups are obtained. 


3.1. 2-D FFT 

A mapping of 2-D FFT has been described in section 2. Figure 15 shows the performance of 2-D FFT on a 16 
processor cluster (image size 256x256). Other parameters are the same as given in Table 1. Solid lines in the graph 
show the computation times for analysis (symbol +) and implementation. We observe that the analytical results are 
very accurate. However, the implementation times are a little more than that given by analysis because implementa- 
tion captures the overhead of index management, etc., which is not included in the analysis. The graph also shows 


Table 2 : Separable Convolution Implementation Results 


Separable Convolution 


Window 10x10 

No. 

Proc. 



Comp. 
Time (ms.) 

Comm. 

Setup 

Comm. 
K Bytes 

Comm. 

Time(ms.) 

1 

3932 

3932 

2607 

0 

0 

0 

2 

1966 

1966 

1310 

2 

20 

4.09 

4 

983 

983 

658 

3 

20 

4.09 

8 

492 

492 

332 

3 

20 

4.09 

16 

246 

246 

169 

3 

20 

4.09 
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Figure 15 : Performance of 2-D FFT on a duster (Analysis and Implementation) 
the corresponding speedups for both cases. Note that speedups obtained through analysis and implementation are 
almost the same and are practically indistinguishable. Figure 16 shows graphs for the communication time. Again, 
implementation and analytical results are very close to each other. 


3.2. Separable Convolution 

3.3. Benchmark Algorithms 

The Image Understanding Benchmark provided the serial version of the programs and the data [2], We 
implemented sobel edge detection and median filtering algorithms. 

3.3.1. Sobel 

Sobel edge detection is a two-dimensional convolution operation with a 3x3 mask. The implementation used 
irwtinm grain parallelism in an MIMD mode, and mapping was similar to that of separable convolution. Table 3 
illustrates the performance results for sobel edge detection algorithm. There were six data sets but here we present 
results using only one data set (test, size 256x256). The results obtained on other data sets were similar. The table 
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Figure 16 : Communication Time for 2-D FFT 

includes all overheads, including program load time, data load time, data input time (from global memory), and time 
to gaihftr results. If all the overhead is included, then the performance for largo- cluster size is sublinear. There are 
two main reasons for this performance. First, amount of computation per pixel is very small (3x3 convolution), and 
second, all the overhead is included in the computation of the speedup. The parameters for communication 
bandwidth are conservative (20 MBytes/sec.), and if the bandwidth is assumed to be larger, then the performance is 
expected to be much better. 


Table 3 : Sobel Edge Detection 


Sobel (Test) 

No. Proc. 

Proc. 

Data load 

Result Output 

Prog. Load 

Data Input 

Total 

Speed up 


Time(sec.) 

Time(Sec.) 

Time(sec.) 

Time(sec.) 

Time(sec.) 

■ununcraiM 


1 

4.04 

0 

0 

0 


4.05 

i 

2 

2.02 

0.056 

0.014 

0.001 


2.1 

1.92 

4 

1.01 

0.056 

0.014 

0.001 

■ 

1.09 

3.70 

8 

0.51 

0.056 

0.014 

0.001 


0.589 

6.91 

16 

0.26 

0.056 

0.014 

0.001 


0.33 

12.13 

32 

0.13 

0.056 

0.014 

0.001 


0.21 

! 19.71 
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3-3.2. Median Filtering 

Table 4.4 shows the performance results for the median filtering algorithm. The algorithm was evaluated on 
the same data set Size of the median filter was 5x5. Data is partitioned along the rows. Each processor is allocated 
an equal number of rows and two boundary rows in each direction. There is no need for communication during the 
algorithm execution. Median filtering does not involve any floating point multiplication or addition operations (only 
comparison operations are needed). Table 4.4 shows that we can obtain good speedups on a cluster for median filter- 
ing. 


4. Performance of Parallel Algorithms on Multiple Clusters 

The extent of inter-cluster communication depends on the type of algorithms, how they are mapped in paral- 
lel, frequency of communication and amount of data to be communicated. As discussed in the first paper [1], these 
requirements vary for algorithms belonging to different classes. 

We are mainly interested in the performance evaluation of parallel algorithms when mapped across clusters. 
The performance of an algorithm will be affected by interference from other processors in the system which are not 
executing the particular algorithm under study. 


Consider a parallel execution of an algorithm across clusters. Suppose the computation time is t cp , intra- 


cluster communication time is r c /, inter-cluster communicadon dme is f^/, and the execution time when the algo- 
rithm is executed on a single processor is t seq . Then the speed up in the best case is given by 

s P = — — (i) 

tcp t c l + tjtf 

That is, assuming there is no interference while accessing the network or the global memory. Under the condi- 
tions in which there are conflicts while accessing the network, the inter-cluster communication time will be given by 


Table 4 : Median Filtering 


Median Filtering (Test) 


Proc. 

Data load 



Data Input 

Total 

Speed up 


Time(sec.) 

Time(Sec.) 



Time(sec.) 

Time(sec.) 




0 

0 



60.37 

1 

2 

— nwdH 





30.30 

1.99 

4 

15.19 

1 

1 



15.31 

3.94 

8 

7.72 

1 

1 



7.85 

7.70 

16 

3.99 





4.11 

14.68 

32 

1.90 


1 



2.02 

29.93 
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wxtid, and therefore, the speed up will be given by 


*seq 


Sp'= — 

tcp ^ ?cl WXticl 

Hence, degradation in speed up with respect to the best case speed up will be 


( 2 ) 


Sp-Sp' m (W-I)xr^ (3) 

Sp t^p 4* td “t* 

This section discusses the performance of various algorithms when mapped across clusters. The algorithms 
are selected according to their communication requirements. We have chosen one algorithm from each of the fol- 
lowing categories; Local Varying, Global Fixed and Global Varying. Algorithms in each of these categories exhibit 
different communication characteristics, and therefore, the analysis will provide the performance of the architecture 
for a wide range of algorithms. 


4.1. Two-Dimensional Fast Fourier Transform (2-D FFT) 

From Section 2 we know that a 2-D FFT can be performed in two steps : a one-dimensional N point FFT 
along the rows followed by a one-dimensional N point FFT along the columns, or vice versa. We use this property 
to map the algorithm across clusters. Hence, dividing the data along rows will not require communication when 
computing one-dimensional FFT. However, communication is needed to obtain transpose of the intermediate 
results. Figure 17 shows an example of the two steps and communication for three clusters. 
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Figure 5 : An Example of Mapping 2-D FFT on Three Clusters 
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Clusters are allocated rows in proportion to their size. A cluster C,- of size P e (t) (i.e., containing P c (i) pro- 
NxP (i) 

cessors) is allocated — rows, where n is the total number of clusters executing the algorithm. Within a clus- 

‘£/»e(0 

ter rows arc equally divided among processors. In the first phase processors compute N point FFT of all the rows in 
their granule. In the second phase, to obtain transpose of the intermediate dat a , processors write the intermediate 
results into the designated global memory locations, which is read by other processors. Data remaining within a 
cluster is transposed using the cluster crossbar. 

The computation time in terms of number of instructions is given by the following. The total number of pro- 
cessors are given by /\ and we assume all clusters have the same size (^ c ). 

12x/'/ 2 l0g2(/'/)Xty ^ 

*cp - p 

where tj j is the number of instruction per floating point operation. The intra-cluster communication time (t c! ) and 
the inter-cluster communication time (tjd) are given by 


tcl = 


2xN 2 (P-\) 


(5) 


?icl ~ 


4XiV 2 x(rt-l)x/ > p xfl 

nxp 


( 6 ) 


where P p is the number of processors per port and R is the communication speed of the network in terms of number 
of instructions/per word transfer. 


Using these parameters for 2-D FFT traffic intensity, computation times, and the parameters from Table 1, we 
evaluate the performance using the analysis presented earlier. Figure 18 shows the speedup obtained for the 2-D 
FFT algorithm. The X-axis shows the number of processors (cluster size is 16). For example, value 48 means that 
the algorithm is executed on 3 clusters, each containing 16 processors. The four different graphs in the Figure show 
speedups for no conflict (best case), low conflict, medium conflict and high conflict cases through the global inter- 
connection network (multi-stage interconnection). We will present similar results when bus is used as the global 
interconnection network later in this section. We observe that speedup obtained under varying degrees of conflicts 
through the network is comparable to that obtained in the best case. However, the best case speedup itself is not 
linear because of the delays through the network and the global memory. 
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Figure 18 : Speedup for 2-D FFT (Multistage Network) 

Figure 19 shows the computation and communication time for 2-D FFT as a function of number of proces- 
sors. Figure 20 shows a blown-up graph for the communication times. The communication time is much smaller 
than the compu tati on time. Furthermore, the communication time also decreases as the number of processors (clus- 
ters) increases. Also note that the intra-cluster communication time is much smaller than the inter-cluster communi- 
cation time. Figure 21 shows percentage degradation in speedup, as defined in Equation (3), for different levels of 
conflict in the network. The degradation in the speedup levels off after increasing initially because the communica- 
tion time decreases as the number of processors increases. 

Figure 22 shows the sensitivity of the speedup to the network bandwidth. The network bandwidth is normal- 
ized to computation speed. For example, value 1 on the X-axis means that it takes the same amount of time (amor- 
tized or in block transfer mode) to write/read a word to/from global memory as it takes to execute one instruction. 
The region on the left of 1 indicates faster communication network, and to the right of 1 indicates slower communi- 
cation network. It is evident from the Figure that degradation in speedup occurs very fast as the communication 
becomes slower. Therefore, in order to obtain any significant speedups from parallel computation, it is important to 
have mairhftft computation and communication speeds; otherwise, increasing the number of processors or the pro- 
cessor speeds will not improve the performance as expected. Figure 22 also illustrates that the four graphs diverge 
as the communication becomes slower meaning that slower performance under heavy traffic suffers more in the 
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Figure 19 : Computation and Communication Times for 2-D FFT (Multistage Network) 



Figure 20 : Communication Times for 2-D FFT (Multistage Network) 
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Figure 21 : Degradation in Speedup Due to Conflicts for 2-D FFT (Multistage Network) 
slower network than under light traffic. 
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Figure 22 : Speedup vs. Network Speed 
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The following is a discussion of the performance of 2D-FFT when a bus is used as a global interconnection 
network. The algorithm is mapped as described above. Since global bus can be accessed by only one processors at a 
time, the inter-cluster communication time becomes additive as the number of clusters is increased. Therefore, the 
performance is expected to be worse than that in the case of the multistage interconnection network. The total com- 
putation time remains the same as in the previous case and is given by 

12xAf 2 log 2 (N)xty m 

{ cp~ p • 

However, the inter-cluster communication time becomes 

‘-£T l IxRxN 2 2x/?x(n-l)X]V 2 
tid= Zt 2 ~ „2 

i-1 n n 

In other words, each cluster needs to send — - — fraction of its data to transpose the intermediate results. 

n 

This is achieved by a designated processor in each cluster, which collects the data and broadcasts it on the bus to be 
read by other cluster processors. Hence, there is an additional overhead of collecting and distributing the intermedi- 
ate data. The intra-cluster communication time in this case is given by 

*ci = l ci\ + ten + ten 
where. 



Figure 23 : Speedup for 2-D FFT (Global Bus) 
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2x(P c -l)xN 2 

t n = , for within cluster transpose, and 

P 2 xn 

t c ii - J c/3 = for sending, receiving and redistributing the intermediate data. 

n 2 

Using these parameters, we evaluate the performance of 2-D FFT under varying degrees of conflicts on the 
bus. Figure 23 shows the speedup for 2-D FFT as a function of the number of processors (cluster size 16). When 
there is no conflict on the bus. the speedup increases with the number of processors. However, under conflicts, the 
speedup first decreases and then increases slowly. In fact, for medium and high conflicts, the speedup obtained on 
one cluster is better than that obtained using multiple clusters. The reason for such poor performance is that even 
though the communication is decomposable in 2-D FFT, the inter-cluster communication time becomes additive due 
to the bus and increases as the number of clusters executing the algorithm increases as shown in Figure 24. 

Figure 25 shows the relative performance degradation in the speedup. The degradation is very significant. 
However, the degradation itself decreases as the number of processors (clusters) increases because more clusters 
execute the algorithm, and consequently, less number of clusters interfere. Figure 26 shows the sensitivity of the 
speedup to the bus speed. Again, the Figure shows that performance degrades rapidly as the bus becomes slower. In 
order for a bus to be viable global interconnection network it is essential that the bus bandwidth be much greater 
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Figure 24 : Computation and Communication Times for 2-D FFT (Global Bus) 
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Figure 25 : Degradation in Speedup Due to Conflicts 2-D FFT (Global Bus) 
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42. Separable Convolution 

This algorithm consists of two steps. First convolution along rows using two one-dimensional masks and then 
convolution along columns of the intermediate results. Partitioning along rows in clusters, therefore, avoids com- 
munication in the first step. However, before the second step can be performed, boundary rows with each cluster 
need to be communicated to other clusters. Figure 27 shows the mapping on three clusters. Note that unlike in 2-D 
FFT, a cluster needs to communicate with at most two other clusters to obtain the upper and lower boundary rows of 
the intermediate results. The number of rows to be exchanged depends on the kernel size. For a kernel size of 

wxw, the number of rows to be exchanged along each direction is — . The amount of communication is fixed and is 

jpriqvnrtpnt of the number of clusters on which the algorithm is mapped. The same mapping will work for regular 
2-D convolution except that the amount of computation per pixel will be larger. 

The computation time for the two steps is given by 


2xry,xNx(y+l) (9) 

tcp 1 = tcp 2 = p 

the intra-cluster communication is given by 

t cl = 2xNxw ( 10 ) 

and, the in ter -cluster communication is given by 

t icl =2xwxNxR 

Figure 28 depicts the speedup obtained for the separable convolution algorithm as a function of the number of 
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Figure 27 : An Example of Mapping Separable Convolution on Three Clusters 
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clusters (cluster size = 16). The speedup increases sublinearly as the number of clusters increases. The reason for 
not obtaining better speedup is that the computation per point of the input is small, and the computation per proces- 
sor decreases as the number of clusters increases, but the communication remains constant (as long as the granular- 


ity per processor is at least ~ rows). Hence, the ratio of computation and communication decreases as the number 


of processors increases. The computation and communication times are shown ifi Figures 29 and 30. Figure 29 
compares the two times whereas Figure 30 shows only the communication time. 


Note that inter-cluster communication can be avoided completely if clusters are assigned overlapped rows to 
perform the first step. That is, if a cluster is responsible to compute the convolution for Ri rows, then its is assigned 
w + /?, rows. Therefore, each cluster has to perform additional computation to obtain 1-D convolution of w addi- 
tional rows. If the extra computation time is less than the communication time then overlapped data partitioning is 
better. 

Figure 31 shows a performance comparison of the two partitioning methods. When the number of processors 
executing an algorithm is small, the performance is almost the same. For smaller window sizes the difference is 
marginal and becomes apparent only when the number of processors becomes large. However, as the window size 
increases (40x40 in Figure 31 ), the performance with overlapped computation becomes poor because the overhead 



Figure 28 : Speedup for Separable Convolution (Multistage Network) 
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Separable Convolution (Multistage Network) 

of extra computation becomes larger than the communication overhead. 

Figure 32 shows the performance of the algorithm when the bus is used as a global interconnection network. 
The speedup increases as the number of clusters increases but eventually levels off. Though inter-cluster communi- 
cation time per cluster is constant, total comm unication time increases as the number of clusters increases, because 
only one cluster can send data on the bus at any time. This is illustrated in Figure 33 where the communication time 
(with no interference) is a linear function of the number of clusters. Another reason for speedup to level off is that 
for a larger number of clusters the computation time becomes comparable or smaller than the communication time. 


4J. Hough Transform 

We have evaluated two mappings far hough transform, namely. Data Partitioning (DP) and Parameter Parti- 
tioning (PP). The difference between the two mappings is described in section 2. Briefly, in DP the data is decom- 
posed among clusters and in PP parameters are decomposed across clusters. In DP , Data is allocated to clusters in 
proportion to their size. Within a cluster data is distributed equally among the processors. The algorithm consists of 
three phases. In the first phase, each processor computes and accumulates the count contributed by its data for all 
the parameter values. Note that each processor maintains the entire accumulator array. In the second phase, partial 





33 




Figure 31 : Comparison of Performance between Overlapped Computation and Communication for Separable Convolution 
(The box on the left graph has been blown up in the right graph) 
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Figure 32 : Speedup for Separable Convolution (Global Bus) 
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Figure 33 : Computation and Communication Times for Separable Convolution (Global Bus) 
results are combined within a cluster, i.e., all the accumulator arrays are added together, and then a designated pro- 
cessor from each cluster writes the accumulator array to designated memory locations. Arrays from all the clusters 
participating in the algorithm execution are then collected by one cluster. In the third phase, the cluster having the 
entire accumulator array computes the local maxima. 

Under PP, each cluster is assigned the entire input data but is assigned only a part of the parameter space. The 
parameter space is partitioned in proportion to the cluster size. Hach cluster receives two more parameters (boun- 
dary values) so that inter-cluster communication is avoided. That is, each cluster performs a fixed amount of addi- 
tional compu tatio n to avoid inter-cluster communication. Within a cluster, however, data is distributed equally 
among the processors, and all processors work on the entire allocated parameter space. Dividing the parameter 
results in mutually exclusive accumulator arrays with processors, and therefore, to compute local maxima, 
there is no need for inter-cluster communication. 

For DP, the computation and communication times for various phases are as follows: t cp i is for computing 
accumulator count, t cp 2 is for combining partial accumulator arrays within a cluster, t cp i is for computing the final 
amimniatnr array, and t cp 4 gives the time to compute the local maxima by one cluster. 

3xtflXN 2 xQ c 

tcp 1 = 



p 


( 12 ) 
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tcpl = Pc x 0<rXlOg 2 / > c 

(n-l)xp c x0 c 
‘cp3 - p c 

3xp c x0 c 

l cp4 ~ p 
1 c 

Intra-cluster and inter-cluster communication times are give by 


(13) 

(14) 

(15) 


tcl = (log 2 P c + l)Xp c X0 c 
nxtfx/\,x0 c xp c 

*ict = p 

r c 

Similarly, the corresponding compulation 


and communication times for PP are given by 


(16) 

(17) 


^cp 1 "" ' 


0 

3xtaxN 2 x(— + 2 ) 

J n 


f cp 2 =log 2 / > c xp c x(— +2) 


(18) 

(19) 


SXpcX©,. 

tcp3 = nxP c 

9. 

tel = (log 2 P c + l)x( — + 2)xp c 
n 


( 20 ) 

( 21 ) 


Figure 34 depicts the speedups for hough transform using the two partitioning methods. Due to the communi- 
cation overhead through global memory, which increases linearly with the number of clusters, the speedup for DP 
levels off. Figure 35 shows the computation and communication times for hough transform, whereas Figure 36 
shows the communication overhead for hough transform in detail. Data partitioning does not perform as well as 
parameter partitioning. However, degradation with respect to best case speedup in DP is small. As we can observe, 
good speedup can be obtained for a global data dependent algorithm like hough transform. Figure 35 and 36 illus- 
trate the computation and communication times for the DP case. 

Figure 37 shows the speedup for hough transform (DP), and Figure 38 depicts the communication and compu- 
tation times, respectively when the bus is used as a global interconnection network. Note that performance of the 
hough transform under PP will be the same in both cases because there is no global communication. 
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Figure 34 : Speedup for Hough Transform (Multistage Network) 


Time 

(Secs.) 


3 —I 


Hough Transform (Multistage Network) 
Computation and Communication Time 

IComp. Time 


l -A 


Communication Times 




1 

16 32 48 64 80 96 112 

Number of Processors (Cluster Size 16) 


128 


Figure 35 : Computation and Communication Times for Hough Transform (Multistage Network) 
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Figure 36 : Communication Times for Hough Transform (Multistage Network) 
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Figure 37 : Speedup for Hough Transform (Global Bus) 
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Figure 38 : Computation and Communication Times Hough Transform (Global Bus) 

5. Summary 

We presented performance evaluation of NETRA using several vision algorithms. The goal of this paper was 
to illustrate the performance of components of NETRA using algorithms with varying characteristics and communi- 
cation requirements. For each algorithm we presented one or more mapping strategies, its performance evaluation, 
and a discussion of the results. The algorithms included 2-D FFT, convolution, separable convolution, hough 
transform, sobel edge detection and median filtering. 

To evaluate parallel algorithms on a cluster, we explore alternative mapping strategies and computation 
modes. Some of the algorithms have been implemented on a simulated cluster and we show that the analysis pro- 
vides very accurate results. We also discussed performance of the algorithms when they are mapped across multiple 
clusters. The results are used to compare alternative inter-cluster communication strategies and they show that it is 
possible to obtain good performance for algorithms with different characteristics under varying degrees of conflicts 
in global interconnection network. 

In general, a multistage interconnection network as the global interconnection performs much better than a 
global bus, as expected. The parameters chosen for processor speed and communication speed were very conserva- 
tive. We think that much faster processors and communication links are possible and available with current technol- 
ogy, and therefore, the performance results presented in this chapter are also conservative. However, we obtained 




insight into the sensitivity of the performance measures as a function of various architecture parameters. 
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